Antipattern: Reference Non-grouped Columns
Let’s explore how misconception about the GROUP BY clause causes an antipattern by testing out different queries.
We'll cover the following
The root cause of this antipattern is simple, and it reveals a common misconception that many programmers have about how grouping queries work in SQL.
The single-value rule#
The rows in each group consist only of items that have the same value in the column or columns we name after GROUP BY
. For example, in the following query, there is a one-row group for each distinct value in product_id
.
Every column in the select list of a query must have a single-value row per row group. This is called the Single-Value Rule. Columns named in the GROUP BY
clause are guaranteed to be exactly one value per group, no matter how many rows the group matches.
The MAX()
expression is also guaranteed to result in a single value for each group: the highest value found in the argument of MAX()
from among all the rows in the group.
However, the database server can’t be as certain about any other column named in the select list. It can’t always guarantee that the same value occurs in every row in a group for the other columns.
In this example, there are many distinct values for bug_id
for a given product_id
because the BugsProducts
table associates multiple bugs to a given product. There’s no way to represent all the values of bug_id
in a grouping query that reduces it to a single row per product.
Since there is no guarantee of a single value per group in the “extra” columns, the database assumes that they violate the Single-Value Rule. Most brands of databases report an error if we run a query that tries to return a column other than the columns named either in the GROUP BY
clause or as arguments to aggregate functions.
MySQL and SQLite have different behavior from other brands of the database, which we have studied in the Legitimate uses of the antipattern section.
Do-what-I-mean queries#
The common misconception that programmers have is that SQL can guess which bug_id
we want in the report because MAX()
is used in another column. Most people assume that if the query fetches the greatest value, then other columns will naturally take their value from the same row where the greatest value occurs.
Unfortunately, SQL can’t make this inference in several cases:
-
If two bugs have the exact same value for
date_reported
and that is the greatest value in the group, which value ofbug_id
should the query report? -
If we query for two different aggregate functions — for example,
MAX()
andMIN()
— these probably correspond to two different rows in the group. Whichbug_id
should the query return for this group?
- If none of the rows in the table matches the value returned by the aggregate function, what is the value of
bug_id
? This commonly occurs for the functionsAVG()
,COUNT()
, andSUM()
.
These are examples of why the Single-Value Rule is important. Not every query that fails to follow this rule would produce an ambiguous result, but many do. It would be clever if the database could tell an ambiguous query from an unambiguous one and produce an error only when the data contains ambiguity. But that would not be good for application reliability. It would mean that the same query might be valid or invalid, depending on the state of data.